Back

Biology Methods and Protocols

Oxford University Press (OUP)

Preprints posted in the last 90 days, ranked by how well they match Biology Methods and Protocols's content profile, based on 53 papers previously published here. The average preprint has a 0.08% match score for this journal, so anything above that is already an above-average fit.

1
Explainable, Lightweight Deep Learning for Colorectal Cancer Microsatellite Instability Screening in Low-Resource Settings

Adegbosin, O. T.; Patel, H.

2026-04-20 oncology 10.64898/2026.04.18.26350809 medRxiv
Top 0.1%
18.4%
Show abstract

BackgroundMicrosatellite stability status determination is important for prognostication and therapeutic decision making in colorectal cancer management, but the conventional methods for this assessment are not readily available, especially in low- and middle-income countries. Deep learning (DL) models have been proposed for addressing this problem; however, potential computational cost due to model complexity and inadequate explainability may limit their adoption in low-resource settings. This study explored the potential of explainable lightweight models for detection of microsatellite instability in colorectal cancer. MethodsDL models were trained using a public dataset of colorectal cancer histology images and then used to classify a set of test images into one of two classes: microsatellite instability or microsatellite stability. The models were compared for efficiency. Gradient-weighted class activation mapping (Grad-CAM) was used to interpret the models decision making. ResultsThe simpler convolutional neural network (CNN) trained from scratch had modest performance (accuracy=0.757, area under receiver-operating characteristic curve [AUROC]=0.840). With an attention mechanism added, these values increased, but specificity and sensitivity reduced. Pretrained models performed better than the ones trained from scratch, and EfficientNet_B0 had the best balance of high performance and low computational requirements (accuracy=0.936, AUROC=0.990, negative predictive value=0.923, specificity=0.953, 4,010,000 trainable parameters, 0.38 gigaFLOPs). However, a simple CNN model with attention mechanism had the best interpretability based on Grad-CAM. ConclusionThis study demonstrated that DL models that are lightweight when compared to previously proposed ones can be useful for colorectal cancer microsatellite instability screening in resource-limited settings while balancing performance and computational efficiency.

2
New three-dimensional preclinical models to understand and treat liver cancers activated for the β-catenin pathway

Bou Malham, V.; Leandre, F.; Hamimi, A.; Lagoutte, I.; Bouchet, S.; Gougelet, A.; Colnot, S.; Desbois-Mouthon, C.

2026-04-03 cell biology 10.64898/2026.04.01.715868 medRxiv
Top 0.1%
15.3%
Show abstract

Background & aimsConstitutive activation of the {beta}-catenin pathway is a determining feature in the pathogenesis of two primary liver cancers, namely HCC and hepatoblastoma (HB). Activating alterations in CTNNB1 gene and, to a lesser extent, inhibiting alterations in APC gene are observed in 30 to 40% of HCC cases and 80 to 90% of HB cases. For both tumours, therapeutic management is far from optimal. Therefore, relevant experimental models are needed to increase our knowledge and test new therapeutic approaches. MethodsOrganoids and tumouroids were established from APC{Delta}hep and {beta}cat{Delta}ex3 mouse models, which are clinically relevant models for {beta}-catenin-activated HCC and mesenchymal HB. We developed a new methodological approach based on a dynamic suspension culture in a rotating bioreactor. Morphological and molecular characteristics and sensitivity to WNTinib, a treatment already successfully tested on human HCC and HB tumouroids, were evaluated by histology, immunohistochemistry, immunofluorescence, and RT-qPCR. ResultsThis easy-to-implement methodology allows for the rapid generation of a large number of organoids and tumouroids that are uniform in size and show no signs of cell death in their core. The robustness of the methodology is illustrated by the maintenance of the histological architecture, cell diversity and gene expression in organoids and tumouroids in comparison with the native liver tissues. In addition, the value of the HCC-derived tumouroids for evaluating cancer treatment was assessed based on their responsiveness to the {beta}-catenin antagonist WNTinib. ConclusionsThe organoids and tumouroids that we present here are new reliable in vitro cancer models, recapitulating the main features of {beta}-catenin-driven HCC and mesenchymal HB. They can be integrated into an appropriate platform for drug screening and could enable the development of "a la carte" therapies that are urgently needed for these indications. Impact and implicationsThis study addresses the critical need for representative in vitro models to investigate {beta}-catenin-driven liver cancers. The organoids and tumouroids developed here are particularly valuable for researchers seeking robust, reproducible models that accurately reflect the cellular diversity and gene expression profiles of native liver tumours. These findings have practical applications in exploring cancer mechanisms, screening new drugs, optimizing personalized treatment strategies, and reducing reliance on animal models, which ultimately benefits patients. HighlightsO_LIEasy and rapid generation of mouse liver organoids and tumouroids from {beta}-catenin activated tumours using culture in a bioreactor C_LIO_LITumouroids preserve histology, cell diversity, and gene expression of native tissue C_LIO_LIHCC-derived tumouroids respond to {beta}-catenin inhibitor WNTinib C_LIO_LIThese reliable 3D models reduce reliance on animal experiments for drug testing C_LI

3
DentaCoPilot: An LLM-Augmented Next-Procedure Recommender for General Dentistry, Designed for Dentist Augmentation

Rodrigues, C. C.; Rebello, S. D.

2026-05-08 dentistry and oral medicine 10.64898/2026.05.07.26352635 medRxiv
Top 0.1%
12.8%
Show abstract

BackgroundCommercial dental artificial intelligence in 2026 is over-whelmingly diagnostic: caries, calculus, periapical, and bone-level detection on radiographs. The clinically harder question that follows every diagno-sis -- given a patients chart and most recent procedure, what should the dentist do next -- remains unsolved at general-dentistry scale. The closest published system, MultiTP (Chen et al., 2024), is a CNN-RNN restricted to partial-edentulism cases and provides neither calibrated uncertainty, structured rationale, nor an evaluation that treats the model as decision support rather than as an autonomous classifier. MethodsWe introduce DentaCoPilot, a recommender that, given a structured chart, returns (i) a calibrated top-K probability distribution over Current Dental Terminology (CDT) codes for the next procedure, (ii) a verbalised confidence label, (iii) an explicit abstain flag when context is insufficient, and (iv) a chartgrounded rationale. We compare four classical baselines (frequency bigram, TF-IDF + logistic regression, XGBoost, MultiTP-style CNN-RNN) and six large-language-model (LLM) variants (Claude Haiku, Sonnet + chain-of-thought, Sonnet + retrieval, Opus + chain-of-thought, Sonnet + classical prior, Opus + classical prior) on a synthetic chart corpus of 500 patients (1,284 test examples). All LLM inference is routed through the local Anthropic Claude Code CLI; every call is logged for full audit. ResultsOn apples-to-apples evaluation, classical baselines reach 0.567 top-1 / 0.967 top-5; pure LLM variants trail at 0.267-0.467 top-1. Prompt-conditioning a Sonnet LLM on the classical baselines top-10 candidates (M5) closes the gap: top-5 rises from 0.733 (pure Sonnet + chain-of-thought) to 0.933, matching classical baselines, while preserving rationale and abstention. Increasing the LLM backbone from Sonnet to Opus does not improve accuracy with or without priming. Calibration via temperature scaling and coverage-risk analysis is reported for the baselines. ConclusionPrompt-conditioning a small LLM on a classical baselines top-K is the most cost-effective LLM design we tested for next-procedure recommendation, and the design preserves the augmentation features that distinguish the system from an autonomous classifier. A pre-registered clinician-in-the-loop evaluation at the KLE Vish-wanath Katti Institute of Dental Sciences (Belgaum, India) and a real-data evaluation on the multi-institutional BigMouth dental data repository are the next stage of work.

4
Hybrid Neural--Bayesian Belief Network Framework for Uncertainty-Aware Multimodal GBM Prediction

Jayme, A.; Heuveline, V.

2026-05-13 health informatics 10.64898/2026.05.10.26352710 medRxiv
Top 0.1%
12.1%
Show abstract

Background and ObjectiveGlioblastoma outcome prediction remains difficult because clinically relevant signals are distributed across heterogeneous imaging and genomic modalities, cohorts are small, and conventional neural predictors do not quantify their own uncertainty. This study evaluates a hybrid neural-Bayesian belief network framework for uncertainty-aware multimodal glioblastoma prediction and examines how modality selection, model family, and structure-aware regularization affect predictive performance and confidence quality. MethodsThe framework was evaluated on the TCGA-GBM radiogenomic cohort using four input modalities (T1Gd, FLAIR, mRNA, and CNA), five model families, five structural-weight settings, and 15 view subsets. A secondary benchmark on the UCI Human Activity Recognition dataset was included to assess whether observed limitations were specific to the glioblastoma setting. ResultsCNA features consistently reduced performance in most multimodal settings, and selective fusion excluding CNA outperformed both the full four-view baseline and imaging-only alternatives. Model families showed clear differences in uncertainty behaviour: non-Bayesian families achieved the strongest predictive accuracy, whereas the Bayesian family achieved the lowest calibration error over a narrower confidence range. Bayesian belief network regularization produced consistent directional improvements without supporting reliable structure-discovery claims, as learned graph structures were not reproducible across folds. On the secondary bench-mark, the same framework achieved much higher predictive performance, indicating that the glioblastoma performance ceiling primarily reflects data limitations rather than an architectural constraint. ConclusionsIn small-sample radiogenomic prediction, modality choice is at least as important as model choice, and uncertainty quality differs substantially across uncertainty-aware model families. The proposed framework provides a practical basis for comparing accuracy, calibration, modality selection, and structure-aware regularization in multimodal biomedical prediction.

5
The use of generative artificial intelligence applications by undergraduate dental students

Brondani, M.; Garbin, J. R.; Soheilipour, S.; Lee, V.

2026-06-02 dentistry and oral medicine 10.64898/2026.05.25.26353910 medRxiv
Top 0.1%
10.6%
Show abstract

Background: Higher education has been transformed by the rapid integration of generative artificial intelligence (GenAI) tools into academia. The objective of the present study was to examine how and for what purposes senior undergraduate dental students use GenAI tools in academic assignments. Methods: This cross-sectional study uses data from three written assignments submitted by two consecutive cohorts of graduating fourth-year dental students at the Faculty of Dentistry at the University of British Columbia, for a total of 120 students. The assignments focused on different subjects where students had to offer their views, including community water fluoridation. When using GenAI, students were asked to disclose whether and how such tools were used, and for what purpose. Descriptive statistics (e.g., means, frequencies, and proportions) were conducted via IBM SPSS Statistics (Version 27.0). Results: From the two cohort of students, 102 (85%) disclosed the use of GenAI tools in at least one assignment; of these, 69 (67.6%) reported using these tools in all three assignments. ChatGPT was by far the most frequently used GenAI tool, reported by 89 students (87.2%). Nine students (8.8%) did not specify which tool they had used. The majority of the students (91.2%, n = 93) reported using GenAI for proofreading or grammatical editing. About 9.8% of the students (n = 10) reported more substantive uses, such as relying on GenAI to generate in part or in full the assignment, and/or assessing the credibility of references. Conclusions: In our study, the use of GenAI tools was highly prevalent among senior undergraduate dental students for editorial purposes. A smaller but notable proportion of students engaged in more substantive uses that may carry academic and ethical risks. There is a need for structured AI literacy training and clear, dentistry-specific guidelines to promote responsible and transparent use while safeguarding critical thinking, academic integrity, and professional judgment in dental education.

6
AENEAS Project: First real-time intraoperative application of machine vision-based anatomical guidance in neurosurgery

Sarwin, G.; Ricciuti, V.; Staartjes, V. E.; Carretta, A.; Daher, N.; Li, Z.; Regli, L.; Mazzatenta, D.; Zoli, M.; Seungjun, R.; Konukoglu, E.; Serra, C.

2026-04-11 surgery 10.64898/2026.04.09.26348607 medRxiv
Top 0.1%
10.5%
Show abstract

Background and ObjectivesWe report the first intraoperative deployment of a real-time machine vision system in neurosurgery, derived from our previous anatomical detection work, automatically identifying structures during endoscopic endonasal surgery. Existing systems demonstrate promising performance in offline anatomical recognition, yet so far none have been implemented during live operations. MethodsA real-time anatomy detection model was trained using the YOLOv8 architecture (Ultralytics). Following training completion in the PyTorch environment, the model was exported to ONNX format and further optimized using the NVIDIA TensorRT engine. Deployment was carried out using the NVIDIA Holoscan SDK, the system ran on an NVIDIA Clara AGX developer kit. We used the model for real-time recognition of intraoperative anatomical structures and compared it with the same video labelled manually as reference. Model performance was reported using the average precision at an intersection-over-union threshold of 0.5 (AP50). Furthermore, end-to-end delay from frame acquisition to the display of the annotated output was measured. ResultsA mean AP50 of 0.56 was achieved. The model demonstrated reliable detection of the most relevant landmarks in the transsphenoidal corridor. The mean end-to-end latency of the model was 47.81 ms (median 46.57 ms). ConclusionFor the first time, we demonstrate that clinical-grade, real-time machine-vision assistance during neurosurgery is feasible and can provide continuous, automated anatomical guidance from the surgical field. This approach may enhance intraoperative orientation, reduce cognitive load, and offer a powerful tool for surgical training. These findings represent an initial step toward integrating real-time AI support into routine neurosurgical workflows.

7
Real-time Computer Vision Assisted Navigation for Endoscopic Pituitary Surgery: Iterative Development and Comparative Preclinical Evaluation

Khan, D. Z.; Mao, Z.; Hudson, G.; Wijekoon, A.; Chen, J.-e.; Borg, A.; Dorward, N.; Blandford, A.; Clarkson, M.; McCulloch, P.; Bano, S.; Stoyanov, D.; Marcus, H.

2026-06-04 surgery 10.64898/2026.06.02.26354760 medRxiv
Top 0.1%
8.8%
Show abstract

Background Endoscopic pituitary surgery involves navigating high-stakes anatomy where complications, such as carotid artery injury, cause devastating morbidity. While computer vision AI offers potential for real-time anatomical recognition to mitigate these risks, successful translation requires rigorous human-factors and performance evaluation. We present the iterative development and preclinical evaluation of a surgeon-controlled, real-time AI-assisted navigation system. Methods Guided by IDEAL Stage 0 and DECIDE-AI frameworks, the study was conducted in two phases. Phase 1 was an exploratory study where surgeons used the system during high-fidelity simulated surgery and provided feedback via "Think Aloud" protocols and surveys. Following prototype iteration, a Phase 2 randomized crossover comparative trial was conducted with 19 neurosurgeons (15 trainees, 4 experts) performing high-fidelity simulated tumour resections with and without AI assistance, separated by a minimum 2-week washout. The primary outcome was surgical technical performance (OSATS). Workload, educational value, usability, trust, and implementation outcomes were also assessed. Results Phase 1 informed hardware, model, and interface refinements, including optimized pedal-controlled overlays and prediction confidence metrics. In the comparative trial, AI assistance significantly improved overall technical performance (OSATS 19.79+/-4.06 vs. 17.32+/-4.11; p=0.027). This gain was experience-dependent; AI significantly augmented trainee performance (19.20+/-3.76 vs. 16.60+/-3.78), narrowing the proficiency gap, while expert performance remained high and stable. 100% of participants identified the system as a useful training tool. However, subjective workload was significantly higher in the AI arm (SURG-TLX 26.42+/-9.56 vs. 22.26+/-7.81; p=0.014). Despite this, usability (SUS 75.13+/-14.31) and implementation feasibility, acceptability, and appropriateness scores were consistently high (means >4.4/5). Conclusions This study provides a stepwise process for real-time AI development using pituitary surgery as a high-stakes exemplar. The refined surgeon-centric AI system improves training and technical performance, particularly for trainees. Next steps involve first-in-human studies and further exploration of longer-term human factors such as over-reliance, cognitive overload mitigation and trust calibration.

8
When clinical prediction models do not generalize: a simulation study in liver transplantation

Brulhart, D.; Magini, G.; Schafer, A.; Schwab, S.; Held, U.

2026-03-20 health informatics 10.64898/2026.03.19.26348780 medRxiv
Top 0.1%
8.7%
Show abstract

Objectives: Clinical prediction models estimate the risk of a future outcome in patients. Such models are often externally validated using independent datasets; however, even when a model has been rigorously validated in a new setting and patient population, its performance across other clinical settings remains unclear. Therefore, we systematically evaluated model performance and clinical utility across diverse patient populations to quantify the limits of transportability. Methods: Using liver transplantation as an example, we used the UK donation-after-circulatory-death (DCD) risk score and descriptive statistics from Swiss DCD liver transplant populations to simulate realistic target populations with varying donor and recipient characteristics. The risk score's ability to predict one-year graft failure was evaluated using calibration intercept, calibration slope, area under the receiver operating characteristic (ROC) curve, and net benefit. Results: The UK DCD Risk Score's performance depended heavily on the simulated population characteristics. While the score performed adequately in settings similar to those where it was derived, it was not satisfactory in others. Discussion: The study showed, using a risk score in liver transplantation as an example, that the application of a prediction model can be limited in certain external populations when they differ, and that its transportability in new settings is not guaranteed. Conclusion: This study highlights the importance of external validation of clinical prediction models to determine transportability to various target populations. Their application requires careful consideration and potential model re-estimation.

9
A Comparative Study in Surgical AI: Datasets, Foundation Models, and Barriers to Med-AGI

Skobelev, K.; Fithian, E.; Baranovski, Y.; Cook, J.; Angara, S.; Otto, S.; Yi, Z.-F.; Zhu, J.; Donoho, D. A.; Han, X. Y.; Mainkar, N.; Masson-Forsythe, M.

2026-03-28 surgery 10.64898/2026.03.26.26349455 medRxiv
Top 0.1%
8.6%
Show abstract

Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but have lagged behind on surgical image-analysis benchmarks. Since surgery requires integrating disparate tasks --- including multimodal data integration, human interaction, and physical effects --- generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away'' with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.

10
Video-based Detection of Delirium in Hospitalized Adults

Mendu, M.; Tesh, R. A.; Pellerin, K.; Steward, G. E.; Cerda, I. H.; Williams, M.; Colman, M.; Shah, S.; Lam, A. D.; Cash, S. S.; Westover, M. B.; Kimchi, E. Y.

2026-05-13 geriatric medicine 10.64898/2026.05.11.26352902 medRxiv
Top 0.1%
8.6%
Show abstract

Delirium, a dynamic neuropsychiatric condition associated with morbidity and mortality, remains underdiagnosed due to reliance on subjective, intermittent screening tools. Objective and potentially continuous identification is needed to improve clinical care. We developed and validated an analytic framework for delirium classification based on automatically extracted video features. In this prospective cohort study, patients ([≥] 18 years) admitted to the inpatient medical or neurological ward of a tertiary academic center between August 2020 and March 2022 with an expected stay longer than one night were enrolled. Daily structured delirium assessments and brief video recordings were performed in consenting patients. Videos were analyzed using deep learning pose estimation to extract keypoints and calculate behavioral features based on eye, face, and limb postures and movements. Four machine learning models (logistic regression, gradient boosting, support vector machines, and random forests) were trained to predict delirium status from extracted features. Model performance was evaluated on 20 repetitions of three-fold cross-validation using the area under the curve of the receiver operating characteristics curve (AUC ROC). The cohort included 109 videos from 25 male and 25 female participants (median age: 72, IQR: 63.25-78). Twenty videos (18%) were from patients with delirium. Keypoints for this dataset were more accurately extracted using a customized ResNet-101 model developed with DeepLabCut (sensitivity 0.94, specificity 0.89, compared to human-labeled gold standards) than using off-the-shelf models. Keypoints were then used to generate behavioral features summarizing movement and postures throughout the video. A support vector machine model achieved an average delirium classification AUC ROC of 0.79 (SD {+/-} 0.09), sensitivity of 0.71 (SD {+/-} 0.16), and specificity of 0.78 (SD {+/-} 0.07). This study demonstrates the feasibility of identifying delirium using brief videos in clinically heterogeneous cohorts and reveals novel features for objective identification. Author SummaryDelirium is a sudden change in attention and awareness that commonly affects hospitalized patients. It is linked with longer hospital stays, cognitive decline, and death. Patients with delirium often show changes in movements and behaviors such as slowed movement, restlessness, or excessive scanning of the environment. Since current screening tools rely on intermittent human interactions, they can be subjective and miss the fluctuating nature of delirium, leading to underdiagnosis. We sought to explore whether short video recordings could be used to detect delirium automatically. In our study, we enrolled 50 hospitalized patients and conducted daily delirium assessments and video recordings. We used a machine learning model to analyze patients eye movements, facial expressions, and body postures. We found that video-derived features could be used to identify delirium in a small clinical cohort. While needing further validation in outside cohorts, this study shows an important proof-of-concept for objective delirium monitoring in heterogeneous clinical contexts without adding burden to clinical staff.

11
Multi-Stain Fusion of Histopathology Images Using Deep Learning for Pediatric Brain Tumor Classification

Spyretos, C.; Tampu, I. E.; Lindblad, J.; Haj-Hosseini, N.

2026-04-14 pathology 10.64898/2026.04.10.717785 medRxiv
Top 0.1%
8.5%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWThe classification of pediatric brain tumors is investigated using deep learning on hematoxylin and eosin (H&E) and antigen Ki-67 (Ki-67) whole slide images (WSIs) from the Childrens Brain Tumor Network (CBTN) dataset. A total of 1,662 unregistered WSIs (1,047 H&E and 615 Ki-67 images) were analyzed, including low-grade glioma/astrocytoma (grades 1, 2) (LGG), high-grade glioma/astrocytoma (grades 3, 4) (HGG), medulloblastoma (MB), ependymoma (EP) and ganglioglioma. The The aim of this study was to effectively classify pediatric brain tumors using H&E and Ki-67 WSIs individually, and to investigate whether early, intermediate, and late fusion could improve the predictive performance. From each WSI, 224x 224 pixel patches were extracted, and the instance (patch)-level features were obtained using the histology foundation model CONCHv1_5. The instances were aggregated using clustering-constrained attention multiple instance learning (CLAM) for patient-level classification. Model interpretability and explainability was assessed through attention heatmaps, cell density and Ki-67 labelling index (LI) maps. In the binary grade classification between LGG and HGG, the intermediate concatenation fusion achieved the best performance with a balanced accuracy of 0.88 {+/-} 0.05, (p < 0.005) compared to the single-stain models (H&E: 0.84 {+/-} 0.05, Ki-67: 0.86 {+/-} 0.05). For the 5-class tumor type classification, the one-hidden layer late fusion learning model achieved the highest balanced accuracy of 0.83 {+/-} 0.04 (p < 0.005), outperforming the single-stain models (H&E: 0.77 {+/-} 0.05, Ki-67: 0.74 {+/-} 0.05). Overall, most of the fusion approaches outperformed the single-stain models in both classification tasks (p < 0.005). The Ki-67 attention maps demonstrated moderate to strong Spearman correlation ({rho} = 0.576 - 0.823) with the cell density and Ki-67 LI maps, suggesting that these features are associated with the models predictions, although additional features may contribute. The results show that H&E and Ki-67 images provide complementary information, and most of the multi-stain fusion approaches using deep learning improve pediatric brain tumor diagnosis.

12
Development and Validation of a Two-Stage NLP-LLM System for Automated Extraction of Deprescribing Recommendations from Discharge Summaries

Fujita, K.; Matheson, M.; Valecha, B.; Hilmer, S. N.

2026-04-30 geriatric medicine 10.64898/2026.04.29.26352010 medRxiv
Top 0.1%
8.4%
Show abstract

IntroductionPolypharmacy in older adults is associated with increased risks of adverse drug events and functional decline. Discharge summaries often contain deprescribing recommendations, but these are frequently overlooked due to documentation complexity. ObjectiveTo develop and validate a two-stage hybrid system combining rule-based natural language processing (NLP) and large language model (LLM) for automated extraction of deprescribing recommendations from discharge summaries. MethodsThis retrospective cohort study included 850 discharge summaries from patients aged [&ge;]65 years with hospitalisation [&ge;]48 hours across six public hospitals in New South Wales, Australia. Model 1 (rule-based NLP) extracted discharge medications and candidate sentences containing pre-defined deprescribing keywords. Model 2 (open-source LLM) classified candidate sentences into five categories. Data were split into training (80%) and test (20%) sets. Gold standard classifications were established by independent reviews, followed by adjudication of discrepancies. ResultsModel 1 extracted 9,631 discharge medications (median 11 per patient) and 1,061 candidate sentences from 850 patients (median age 82.8 years). Model 2 achieved an F1 score of 0.91 and accuracy of 0.90 on the test set. Inter-rater reliability showed substantial agreement (Cohens kappa = 0.70). The most frequently identified medications recommended for deprescribing were antibiotics and opioids. The most common misclassification was incorrectly identifying actions completed during hospitalisation as post-discharge recommendations. The combined processing time averaged 12.6 seconds per discharge summary. ConclusionsA two-stage hybrid approach combining rule-based NLP and open-source LLM can accurately extract deprescribing recommendations from discharge summaries, enabling cost-efficient, privacy-compliant local deployment. Key Points- A two-stage system combining rule-based NLP and open-source LLM extracted and classified deprescribing recommendations from 850 discharge summaries, achieving an F1 score of 0.91 and accuracy of 0.90. - The use of an open-source LLM (Llama 3.3) enables cost-efficient, privacy-compliant local deployment in healthcare institutions. - Antibiotics and opioids were the most frequently identified medications recommended for deprescribing in discharge summaries.

13
Elder-Sim: A Psychometrically Validated Platform for Personality-Stable Elderly Digital Twins

Wang, J.; Yang, Z.; Zhu, Z.; Zhu, X.; Huang, Z.; Wang, H.; Tian, L.; Cao, Y.; Qu, X.; Qi, X.; Wu, B.

2026-03-30 geriatric medicine 10.64898/2026.03.25.26349036 medRxiv
Top 0.1%
8.4%
Show abstract

Background: LLMs enable patient-facing conversational agents, creating a pathway toward digital twins that capture older adults' lived experiences and behavioral responses across time. A central barrier is personality drift---inconsistent trait expression across repeated interactions---which undermines reliability of generated trajectories and intervention-response simulation in geriatric care. Objective: To develop ELDER-SIM, a multi-role elderly-care conversational platform for building personality-stable digital twin agents, and to propose a psychometric validation framework for quantifying personality consistency in LLM-based agents. Methods: ELDER-SIM was implemented via n8n workflow orchestration with local LLM inference (Ollama/vLLM), integrating (1) Big Five (OCEAN) trait specifications, (2) a Cognitive Conceptualization Diagram (CCD) grounded in Beck's CBT framework, and (3) a MySQL-based long-term memory module. Ablation studies across four conditions---Baseline, +Memory, +CCD, and +LoRA (fine-tuned on 19,717 instruction pairs from CHARLS)---were evaluated via Cronbach's $\alpha$, ICC, and role discrimination accuracy. Results: Personality measurement reliability was acceptable to excellent across conditions (Cronbach's : 0.70-0.94), with consistently high test-retest stability (ICC: 0.85- 2 0.96). Role discrimination improved stepwise from 83.3% (Baseline) to 88.9% (+Memory), 94.4% (+CCD), and 97.2% (+LoRA). CCD produced the largest gain in internal consistency (mean 0.702[-&gt;]0.892), while LoRA achieved the highest overall internal consistency ( 0.940) and ICC (0.958). Conclusions: ELDER-SIM provides a psychometrically validated approach for constructing personality-consistent elderly digital twin agents. Structured cognitive modeling and domain adaptation reduce personality drift, supporting reliable longitudinal simulation for elderly mental health care and reproducible in silico evaluation before clinical deployment.

14
Quantifying PD1 saturation by PDL1 in tumor tissue using a novel RNA aptamer-based assay

Veeramani, S.; Yin, C.; Yu, N.; Coleman, K. L.; Smith, B. J.; Weiner, G. J.

2026-04-08 immunology 10.64898/2026.04.06.716702 medRxiv
Top 0.1%
7.1%
Show abstract

BackgroundTherapeutic agents targeting the PD1-PDL1 interaction are of great clinical value, however accurately predicting which patients are most likely to benefit is challenging. Improved predictive biomarkers for anti-PD1 therapy are clearly needed. Quantifying PD1 saturation by PDL1 in tumor tissue has the potential to serve as such a biomarker. Here we report a novel bioassay called the PD1 Ligand Receptor Complex Aptamer (LIRECAP) assay and demonstrate it can be used to quantify the saturation of PD1 by PDL1 in formalin-fixed paraffin-embedded tumor biospecimens. ResultsThe PD1 LIRECAP assay was developed by identifying a pair of RNA aptamers. One aptamer preferentially binds to unoccupied PD1 (P aptamer) and the other to the PD1-PDL1 complex (C aptamer). P and C aptamers were added together to a formalin-fixed sample, and bound aptamer extracted. A 2-color qRT-PCR assay using a single set of primers was used to determine the ratio of the sample-bound C to P aptamers (C:P ratio) which reflected PD1 saturation by PDL1 in the sample. Quantification of PD1 saturation by PDL1 as determined by the PD1 LIRECAP assay correlated closely with PD1-mediated signaling and PD1-PDL1 proximity. Analysis of sarcoma FFPE biospecimens confirmed the assay is technically reproducible on clinical biospecimens. There were significant differences in PD1 saturation by PDL1 between patients as well as considerable intratumoral heterogeneity. ConclusionsThe PD1 LIRECAP assay is novel assay that can be used to quantify PD1 saturation by PDL1 in clinical biospecimens. The assay is technically feasible, reproducible, and has the potential to serve as a superior predictive biomarker for PD1/PDL1-based therapy. Similar assays based on this platform could be used in other systems and settings to quantify interaction between two molecules.

15
Comparison of foundation models and transfer learning strategies for diabetic retinopathy classification

Li, L. Y.; Lebiecka-Johansen, B.; Byberg, S.; Thambawita, V.; Hulman, A.

2026-04-20 health informatics 10.64898/2026.04.17.26351092 medRxiv
Top 0.1%
6.8%
Show abstract

Diabetic retinopathy (DR) is a leading cause of vision impairment, requiring accurate and scalable diagnostic tools. Foundation models are increasingly applied to clinical imaging, but concerns remain about their calibration. We evaluated DINOv3, RETFound, and VisionFM for DR classification using different transfer learning strategies in BRSET (n = 16,266) and mBRSET (n = 5,164). Models achieved high discrimination in binary classification (normal vs retinopathy) in BRSET (AUROC 0.90-0.98), with DINOv3 achieving the best under full fine-tuning (AUROC 0.98 [95% CI: 0.97-0.99]). External validation on mBRSET showed decreased performance for all models regardless of the fine-tuning strategy (AUROC 0.70-0.85), though fine-tuning improved performance. Foundation models achieved strong discrimination but poor calibration, generally overestimating DR risk. While the generalist model, DINOv3, benefited from deeper fine-tuning, miscalibration remained evident. These findings underscore the need to improve calibration and the comprehensive evaluation of foundation models, which are essential in clinical settings. Author summaryArtificial intelligence is increasingly being used to detect eye diseases such as diabetic retinopathy from retinal images. Recent advances have introduced "foundation models," which are trained on large datasets and can be adapted to new tasks. We aimed to evaluate how well these models perform in a clinical prediction context, with a focus not only on accuracy but also on how reliably they estimate disease risk. In this study, we compared different types of foundation models using two independent datasets from Brazil. We found that while these models were generally good at distinguishing between healthy and diseased eyes, their predicted risks were often poorly calibrated. In other words, the estimated probabilities did not consistently reflect the true likelihood of disease. We also examined whether adapting the models to the target population could improve performance. Although this approach led to improvements, calibration issues remained. However, post-training correction improved the agreement between predicted risks and observed outcomes. Our findings highlight an important gap between model performance and clinical usefulness. We suggest that improving the reliability of risk estimates is essential before such systems can be safely used in healthcare.

16
Calibrating trust in AI-assisted pituitary surgery

Hudson, G. R.; Khan, D. Z.; Fayez, F.; Bhatia, S.; Bano, S.; Costanza, E.; Blandford, A.; Stoyanov, D.; McCulloch, P.; Marcus, H. J.; University College London Collaborators,

2026-06-04 surgery 10.64898/2026.06.02.26354735 medRxiv
Top 0.1%
6.7%
Show abstract

Background: Endoscopic endonasal transsphenoidal surgery (EETS) requires navigation around neurocritical anatomy. Today, artificial intelligence clinical decision support systems (AI-CDSSs) can orientate surgeons, but clinician trust in AI remains unclear, limiting safe deployment. This study evaluates how modifiable design affects trust and performance in a real-world pituitary surgery AI-CDSS. Method: Online, 70 clinicians with pituitary surgery experience were randomised evenly to a Basic or Enhanced AI-CDSS which outline the sella on EETS operative video. The Enhanced group additionally received explanation of the model and previous publications, alongside confidence labels depicting outline reliability. Both groups annotated the sella on six video clips, first alone then with the optional AI-CDSS. Clips were ordered by declining AI performance, except for the final clip. Self-reported trust was measured using a 1-7 scale after each annotation, and performance was the DICE overlap between user annotations and the ground truth. Comparisons used Mann-Whitney U and permutation analysis. Results: Sixty-four participants (91%) finished the exercise (31 Basic, 33 Enhanced). When AI performed best, median trust was 5.00 in both arms (U=559, p=.521). However, when AI performed worst, trust was significantly lower for the Enhanced group (3.00 vs 3.67, U=668, p=.035), sustained in the final clip (3.67 vs 4.33 U=687, p=.019). User performance improved with the AI-CDSS, but with no significant difference between the groups on the best or worst AI performing clips. Nevertheless, for the best AI, senior clinicians had higher median performance in the Enhanced group (0.95 vs 0.90, U=75, p=.066). There was also less dispersion in the Enhanced group when AI was inaccurate (IQR: 0.07 vs 0.21, p=.004). Conclusion: Interface design can improve trust calibration in a surgical AI-CDSS and may increment performance in seniors when AI is accurate, and consistency when AI is inaccurate. In future, these features may form important safety checks during translation to the operating room.

17
Analysis and Mitigation of Equipment-induced Shortcuts in AI Models for Laparoscopic Cholecystectomy

Protserov, S.; Repalo, A.; Mashouri, P.; Hunter, J.; Masino, C.; Madani, A.; Brudno, M.

2026-04-24 surgery 10.64898/2026.04.22.26351545 medRxiv
Top 0.1%
6.6%
Show abstract

Machine learning models have seen a lot of success in medical image segmentation domain. However, one of the challenges that they face are confounders or shortcuts: spurious correlations or biases in the training data that affect the resulting models. One example of such confounders for surgical machine learning is the setup of surgical equipment, including tools and lighting. Using the task of identification of safe and dangerous zones of dissection in laparoscopic cholecystectomy images and videos as a use-case, we inspect two equipment-induced biases: the presence of surgical tools in the field of view and the position of lighting. We propose methods for evaluating the severity of these biases and augmentation-based methods for mitigating them. We show that our tool bias mitigations improve the models consistency under tool movements by 9 percentage points in the most inconsistent cases, and by 4 percentage points on average. Our lighting bias mitigations help reduce fraction of true dangerous zone pixels that may be predicted as safe under light changes from 5% to 1.5%, without compromising segmentation quality.

18
The impact of non-invasive prehabilitation before surgery on emotional well-being in neuro-oncology patients: Insights from the Prehabilita project

Brault-Boixader, N.; Roca-Ventura, A.; Delgado-Gallen, S.; Buloz-Osorio, E.; Perellon-Alfonso, R.; Hung Au, C.; Bartres-Faz, D.; Pascual-Leone, A.; Tormos Munoz, J. M.; Abellaneda-Perez, K.; Prehabilita Working Group,

2026-04-12 oncology 10.64898/2026.04.08.26350382 medRxiv
Top 0.1%
6.6%
Show abstract

Prehabilitation (PRH) is a preoperative process aimed at optimizing patients functional capacity to improve surgical outcomes and overall well-being. While its physical and cognitive benefits are increasingly documented, its emotional impact, particularly in neuro-oncology patients, remains less explored. This study assessed the psychological effects of a PRH program on 29 brain tumor patients. The primary outcome, emotional well-being, was measured using quality of life and emotional distress metrices. Secondary outcomes included perceived stress levels and control attitudes. Additionally, qualitative data from structured interviews provided further insights into the psychological effects of the intervention. The results indicated significant improvements in quality of life and reductions in emotional distress, particularly among women. While perceived stress levels remained stable, control attitudes showed an increase. Qualitative analysis further highlighted the positive changes in the control sense and identified additional factors, such as the importance of social support sources during the PRH process. Overall, these findings suggest that PRH interventions play a significant role in enhancing emotional well-being among neuro-oncological patients in the preoperative phase. These results underscore the importance of implementing comprehensive and personalized PRH approaches to optimize clinical status both before and after surgery, thereby promoting sustained psychological benefits in this population. This study is based on data collected at Institut Guttmann in Barcelona in the context of the Prehabilita project (ClinicalTrials.gov identifier: NCT05844605; registration date: 06/05/2023).

19
Detection of Hepatocellular Carcinoma from B-Mode and Contrast-Enhanced Ultrasound Using a Dual-Path Convolutional Network

Obeti, F.; Asiku, R. A.

2026-05-05 oncology 10.64898/2026.05.04.26352359 medRxiv
Top 0.1%
6.5%
Show abstract

BackgroundHepatocellular carcinoma (HCC) is a leading cause of cancer-related mortality worldwide, with particularly severe consequences in sub-Saharan Africa where access to advanced diagnostic imaging remains limited. Ultrasound is the most widely available imaging modality in low-resource settings, yet its sensitivity for detecting early-stage HCC remains insufficient when used in conventional B-mode alone. MethodsWe present a dual-path convolutional neural network (CNN) that jointly analyzes B-mode and contrast-enhanced ultrasound (CEUS) images for automated HCC detection. The model processes 1,057 labeled liver ultrasound images from 85 patients sourced from The Cancer Imaging Archive, a publicly available single-center dataset. A preprocessing pipeline extracts liver-centered regions of interest from heterogeneous DICOM files, including automatic separation of dual-panel B-mode and CEUS frames. Each imaging modality is processed through a dedicated ResNet-34 backbone initialized with ImageNet weights, and the resulting feature embeddings are fused through a late-fusion classification head. The model is evaluated using patient-wise five-fold cross-validation and a held-out 20% patient-level test set. ResultsOn the held-out test set, the model achieved 94.2% accuracy, 93.6% precision, 100% sensitivity, 83.3% specificity, and a 96.7% F1-score for binary HCC versus non-HCC classification. Cross-validation analysis showed consistently high discrimination across folds, with AUC values ranging from 0.93 to 0.98. Training dynamics indicated that early stopping typically activated between epochs seven and eleven, with validation loss closely tracking training loss and no evidence of severe overfitting under the chosen regularization scheme. ConclusionsThese findings demonstrate that a relatively lightweight multimodal CNN, trained on carefully preprocessed public data, can provide strong imaging-level discrimination between HCC and non-HCC findings within a single-center dataset. However, the small sample size, pronounced class imbalance, and single-center origin of the data preclude any claims of clinical utility at this stage. This work is a transparent, reproducible methodological baseline intended to support future multi-site validation, particularly in African and other low-resource clinical settings where ultrasound-based decision support could have the greatest impact.

20
Healthcare workers' acceptance of artificial intelligence in cardiac diagnosis: Implications for medical education and training programs

Hussein, G.; AlShammri, M.; Aldosari, M.; Alshehri, R.; Almasari, G.; Alabdulrahman, R.; Alarfaj, R.; Alrashed, A.; Al-Walah, M. A.

2026-05-10 cardiovascular medicine 10.64898/2026.05.06.26352604 medRxiv
Top 0.1%
6.3%
Show abstract

The integration of artificial intelligence (AI) in cardiology requires healthcare worker acceptance for successful implementation. Understanding attitudes and educational needs is crucial for developing effective training programs. A cross-sectional survey was conducted among 408 healthcare workers treating cardiac diseases in Riyadh, Saudi Arabia. We assessed AI acceptance, knowledge levels, and training preferences using validated scales. Statistical analyses included descriptive statistics, chi-square tests, correlation analysis, reliability testing, and multiple logistic regression. Of 408 participants, 407 provided complete responses. The sample comprised predominantly young (87.0% aged [&le;]30), female (75.7%) medical residents (89.9%) with limited AI experience (86.7% never used AI clinically). Internal consistency was excellent (Cronbachs = 0.892). Moderate acceptance was observed: 49.9% were aware of AI applications in cardiology, 46.7% were willing to learn, and 42.8% were willing to use AI clinically. However, 49.1% acknowledged lacking sufficient AI knowledge. Logistic regression identified willingness to learn (OR = 3.24, 95% CI: 2.15-4.89) and training interest (OR = 2.87, 95% CI: 1.94-4.25) as the strongest predictors of AI acceptance. The model explained 68.4% of variance (Nagelkerke R{superscript 2} = 0.684) with an AUC of 0.847. Medical residents demonstrate moderate AI acceptance but significant knowledge gaps. Educational interventions--particularly hands-on learning and institutional training programs--are the strongest drivers of AI readiness, surpassing demographic predictors. Integrating AI literacy systematically into medical curricula is essential for successful AI adoption in cardiovascular care. Author summaryHealthcare workers worldwide are increasingly encountering artificial intelligence (AI) tools in clinical settings, yet their readiness to adopt these technologies--particularly in specialized fields like cardiology--remains poorly understood, especially in rapidly developing healthcare systems. In this study, we surveyed 407 healthcare workers in Riyadh, Saudi Arabia, to understand their current attitudes, knowledge gaps, and learning preferences regarding AI in cardiac diagnosis. Our findings reveal that while most participants hold cautious optimism about AI, nearly half acknowledge lacking the knowledge needed to use it confidently. Crucially, we found that educational factors--specifically willingness to learn and interest in institutional training--were far stronger predictors of AI acceptance than demographic characteristics such as age or gender. This means that AI readiness is not a fixed trait determined by who someone is, but a teachable and trainable capacity. These results carry direct implications for medical educators and policymakers: structured, hands-on AI training integrated throughout medical curricula can meaningfully accelerate adoption of beneficial technologies in cardiovascular care and beyond.